In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
from dateutil import parser
from datetime import datetime

First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use 311-2014.csv. You can rename it.

Importing and preparing your data

Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.


In [30]:
complaint_df = pd.read_csv("311-2014.csv", parse_dates = True).head(200000)


/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (8,17,40,41,42,43,44,45,46,47,48,49) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

In [32]:
complaint_df.dtypes


Out[32]:
Unique Key                          int64
Created Date                       object
Closed Date                        object
Agency                             object
Agency Name                        object
Complaint Type                     object
Descriptor                         object
Location Type                      object
Incident Zip                       object
Incident Address                   object
Street Name                        object
Cross Street 1                     object
Cross Street 2                     object
Intersection Street 1              object
Intersection Street 2              object
Address Type                       object
City                               object
Landmark                           object
Facility Type                      object
Status                             object
Due Date                           object
Resolution Description             object
Resolution Action Updated Date     object
Community Board                    object
Borough                            object
X Coordinate (State Plane)        float64
Y Coordinate (State Plane)        float64
Park Facility Name                 object
Park Borough                       object
School Name                        object
School Number                      object
School Region                      object
School Code                        object
School Phone Number                object
School Address                     object
School City                        object
School State                       object
School Zip                         object
School Not Found                   object
School or Citywide Complaint      float64
Vehicle Type                       object
Taxi Company Borough               object
Taxi Pick Up Location              object
Bridge Highway Name                object
Bridge Highway Direction           object
Road Ramp                          object
Bridge Highway Segment             object
Garage Lot Name                    object
Ferry Direction                    object
Ferry Terminal Name                object
Latitude                          float64
Longitude                         float64
Location                           object
dtype: object

In [8]:
import datetime
# complaint_df['Created Date'] = complaint_df['Created Date'].astype(dateutil.parser.parse)

In [34]:
complaint_df['Created Date'] = complaint_df['Created Date'].apply(dateutil.parser.parse)

In [36]:
complaint_df['Created Date'].describe()


Out[36]:
count                  200000
unique                 177667
top       2015-05-04 00:00:00
freq                     1034
first     2015-01-01 00:08:34
last      2016-01-01 00:00:00
Name: Created Date, dtype: object

In [38]:
complaint_df['Created Date'].head()


Out[38]:
0   2015-07-06 10:58:27
1   2015-07-03 13:26:29
2   2015-11-09 03:55:09
3   2015-07-03 02:18:32
4   2015-07-04 00:03:27
Name: Created Date, dtype: datetime64[ns]

In [40]:
#import datetime

#dt = datetime.datetime.strptime(string_date, fmt)

In [42]:
#import numpy as np
# Build a function using that method
#str_time = "07/06/2015 10:58:27 AM"
#def time_to_datetime(str_time):
 #   try:
   #     dt = parser.parse(str_time)
  #      return dt
   # except:
    #    return np.nan
#time_to_datetime(str_time)

In [44]:
# complaint_df['Created Date'] = complaint_df['Created Date'].apply(time_to_datetime)

In [46]:
complaint_df['Created Date'].describe()


Out[46]:
count                  200000
unique                 177667
top       2015-05-04 00:00:00
freq                     1034
first     2015-01-01 00:08:34
last      2016-01-01 00:00:00
Name: Created Date, dtype: object

What was the most popular type of complaint, and how many times was it filed?


In [ ]:
complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)

Make a horizontal bar graph of the top 5 most frequent complaint types.


In [48]:
import matplotlib.pyplot as plt
%matplotlib inline
freqNYcomplaints = complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(5).plot(kind="bar", color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'])
freqNYcomplaints.set_title('Most Frequent NYC Complaints')
freqNYcomplaints.set_xlabel('Type of Complaint')
freqNYcomplaints.set_ylabel('Frequency of Complaint')
plt.savefig('freqNYcomplaints.png')


Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.


In [ ]:
complaint_df.head()

According to your selection of data, how many cases were filed in March? How about May?


In [56]:
#confirm this is a date type 
# complaint_df['Created Date']

In [ ]:


In [55]:
complaint_df.index = complaint_df['Created Date']
# complaint_df.head()

In [199]:
#complaint_df = complaint_df.set_index(['Created Date'])
#complaint_df['Created Date'] = pd.to_datetime(complaint_df['Created Date'])
ax = complaint_df['2015-03'].resample('D').count().plot(y='Unique Key')
# complaint_df['2015-04']['Complaint Type'].count()
ax.set_title('NYC Complaints in March 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MarchfreqNYcomplaints.png')



In [200]:
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')



In [58]:
complaint_df.head()


Out[58]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address ... Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
Created Date
2015-07-06 10:58:27 31015465 2015-07-06 10:58:27 07/22/2015 01:07:20 AM DCA Department of Consumer Affairs Consumer Complaint Demand for Cash NaN 11360 27-16 203 STREET ... NaN NaN NaN NaN NaN NaN NaN 40.773540 -73.788237 (40.773539552542, -73.78823697228408)
2015-07-03 13:26:29 30997660 2015-07-03 13:26:29 07/03/2015 02:08:20 PM NYPD New York City Police Department Vending In Prohibited Area Residential Building/House 10019 200 CENTRAL PARK SOUTH ... NaN NaN NaN NaN NaN NaN NaN 40.767021 -73.979448 (40.76702142171206, -73.97944780718524)
2015-11-09 03:55:09 31950223 2015-11-09 03:55:09 11/09/2015 08:08:57 AM NYPD New York City Police Department Blocked Driveway No Access Street/Sidewalk 10453 1993 GRAND AVENUE ... NaN NaN NaN NaN NaN NaN NaN 40.852671 -73.910608 (40.85267061877697, -73.91060771362552)
2015-07-03 02:18:32 31000038 2015-07-03 02:18:32 07/03/2015 07:54:48 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 11372 84-16 NORTHERN BOULEVARD ... NaN NaN NaN NaN NaN NaN NaN 40.755774 -73.883262 (40.755773786469966, -73.88326243225418)
2015-07-04 00:03:27 30995614 2015-07-04 00:03:27 07/04/2015 03:33:09 AM NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11216 1057 BERGEN STREET ... NaN NaN NaN NaN NaN NaN NaN 40.676175 -73.951269 (40.67617516102934, -73.9512690004692)

5 rows × 53 columns

I'd like to see all of the 311 complaints called in on April 1st.

Surprise! We couldn't do this in class, but it was just a limitation of our data set


In [67]:
print("There were", complaint_df['2015-04-01']['Unique Key'].count(), "complaints on April 1st.")


There were 573 complaints on April 1st.

What was the most popular type of complaint on April 1st?

What were the most popular three types of complaint on April 1st


In [71]:
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)


Out[71]:
Complaint Type
Illegal Parking    67
Name: Complaint Type, dtype: int64

In [74]:
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(3)


Out[74]:
Complaint Type
Illegal Parking     67
Street Condition    64
Blocked Driveway    58
Name: Complaint Type, dtype: int64

What month has the most reports filed? How many? Graph it.


In [77]:
complaint_df.resample('M')['Complaint Type'].count().sort_values(ascending=False).head(1)


Out[77]:
Created Date
2015-05-31    49715
Name: Complaint Type, dtype: int64

In [201]:
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')


What week of the year has the most reports filed? How many? Graph the weekly complaints.


In [97]:
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(3)


Out[97]:
Created Date
2015-05-10    13559
2015-05-17    11683
2015-05-24    10351
Name: Complaint Type, dtype: int64

In [102]:
complaint_df['2015-05-10'].resample('W')['Complaint Type'].count().plot(y='Unique Key')


/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/axes/_base.py:2787: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=2367.0, right=2367.0
  'left=%s, right=%s') % (left, right))
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cb0fef0>

Noise complaints are a big deal. Use .str.contains to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).


In [202]:
ax = complaint_df['Complaint Type'].str.contains('Noise').resample('W').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in 2015')
ax.set_xlabel('Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('AnnualfreqNYcomplaints.png')



In [210]:
ax = complaint_df['Complaint Type'].str.contains('Noise').groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key')

ax.set_title('Daily NYC SoundComplaints in 2015')
ax.set_xlabel('Number of Complaints')
plt.savefig('DailyfreqNYSoundcomplaints.png')

#ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
#ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])



In [ ]:

Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.


In [124]:
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(5)


Out[124]:
Created Date
2015-05-10    13559
2015-05-17    11683
2015-05-24    10351
2015-05-03    10184
2015-05-31     9387
Name: Complaint Type, dtype: int64

In [176]:
#may_17 = complaint_df['2015-05-17']
#ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key')

In [183]:
may_17 = complaint_df['2015-05-17']
ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key', label="May 17th")
may_24 = complaint_df['2015-05-24']
may_24.groupby(by=may_24.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 24th")
may_03 = complaint_df['2015-05-03']
may_03.groupby(by=may_03.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 3rd")
may_10 = complaint_df['2015-05-10']
may_10.groupby(by=may_10.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 10th")
may_31 = complaint_df['2015-05-31']
may_31.groupby(by=may_31.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 31st")
ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])


Out[183]:
[<matplotlib.text.Text at 0x17332efd0>,
 <matplotlib.text.Text at 0x173333080>,
 <matplotlib.text.Text at 0x173364080>,
 <matplotlib.text.Text at 0x17335cd68>,
 <matplotlib.text.Text at 0x173369320>,
 <matplotlib.text.Text at 0x173369d68>,
 <matplotlib.text.Text at 0x1733746a0>,
 <matplotlib.text.Text at 0x173369908>,
 <matplotlib.text.Text at 0x173355978>,
 <matplotlib.text.Text at 0x17338d2b0>,
 <matplotlib.text.Text at 0x17335c5f8>,
 <matplotlib.text.Text at 0x1733170b8>,
 <matplotlib.text.Text at 0x173322ef0>,
 <matplotlib.text.Text at 0x173325160>,
 <matplotlib.text.Text at 0x1733a9748>,
 <matplotlib.text.Text at 0x1733a9f60>,
 <matplotlib.text.Text at 0x1733ad7b8>,
 <matplotlib.text.Text at 0x1733ae1d0>,
 <matplotlib.text.Text at 0x1733aec18>,
 <matplotlib.text.Text at 0x1733b26a0>,
 <matplotlib.text.Text at 0x1733b5128>,
 <matplotlib.text.Text at 0x1733b5b70>,
 <matplotlib.text.Text at 0x1733ba5f8>,
 <matplotlib.text.Text at 0x1733bc080>]

In [174]:
ax


Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x171ae2198>

In [182]:
# ax = complaint_df['2015-05-10'].resample('H').count().plot(y='Unique Key')
#ax = complaint_df['2015-05-17'].groupby(by=complaint_df.index.hour).count().plot(y='Unique Key', label = "May 17th")
# complaint_df['2015-05-24'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 24th")
# complaint_df['2015-05-03'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 3rd")
# complaint_df['2015-05-31'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 31st")
# # the least sold in a month (e.g., the worst January ever , the worst July ever, etc)
# complaint_df['2015-05-17'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = 'May 17th')

# the most sold in a month (e.g., the best January ever , the best July ever, etc)
#df.groupby(by=df.index.month).max().plot(y='val', ax=ax, label = 'Most Sold')

What hour of the day are the most complaints? Graph a day of complaints.


In [197]:
#make a graph of the average 
ax = complaint_df.groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key', label='Average')
#x_values = complaint_df.groupby(by=complaint_df.index.hour).median().index
# get the minimum
#min_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].min()
#get maxs
#max_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].max()
#ax.fill_between(x_values, min_values, max_values, alpha=0.3)
#ax.set_title("Fluctuation of House Sales")



In [192]:
complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].median()


Out[192]:
0     30614734
1     30867364
2     30793150
3     30785415
4     31058530
5     30853700
6     31021120
7     30812242
8     30782634
9     30776134
10    30766641
11    30774043
12    30780706
13    30775608
14    30766844
15    30747919
16    30746354
17    30800853
18    30752702
19    30804272
20    30945024
21    30924831
22    30893587
23    30857818
Name: Unique Key, dtype: int64

One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?


In [214]:
ax = complaint_df['Complaint Type'].str.contains('04:').resample('T').count().plot(y='Unique Key')



In [ ]:


In [ ]:

So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.


In [ ]:

Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).


In [ ]:

Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?


In [ ]:

Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: